library(tidyverse)
library(dplyr)
library(class)
library(cvTools)
library(randomForest)
library(GEOquery)
library(R.utils)
library(reshape2)
library(limma)
library(tuneR)
library(devtools)
library(ggplot2)
library(tsfeatures)
library(sf)
library(plotly)
library(DT)
library(e1071)
library(caret)
library(seewave)
# Import the CSV Data
df_reef <- read.csv("Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv")
dim(df_reef)
head(df_reef)
print(distinct_countries <- df_reef %>% distinct(Country) %>% nrow())
# Filter the data-set by the Average bleaching
bleach <- df_reef %>% filter(Average_bleaching > 0)
dim(bleach)
## [1] 3158 55
non_bleach <- df_reef %>% filter(Average_bleaching == 0)
dim(non_bleach)
## [1] 6057 55
p <- ggplot(bleach, aes(x=Average_bleaching, y=Year)) + geom_boxplot(aes(group=Year)) + ggtitle("The trend of Average Bleaching vary between 1998 to 2017")
p
Boxplot of Years Versus Average Bleaching shows the average bleaching in different years. A trend of increasing average bleaching by year is indicated by the amount of maximum outliers. There are more maximum outliers in recent years than in previous years. 2017 seems like an exceptional case, probably due to a lack of data from that year. The median values and standard deviation of average bleaching are not too meaningful because there are too many outliers.
p <- ggplot(bleach, aes(x=Temperature_Mean, y=Year)) + geom_boxplot(aes(group=Year)) + ggtitle("The trend of Average Temperature of bleached location vary between 1998 to 2017")
p
The Boxplot of Years versus Average Temperature shows that the average temperature of the bleached locations did not change over time. There is no clear trend indicating whether the average temperature increased or decreased over time. Therefore, it would be better to further investigate the relationship between Average Bleaching and Average Temperature to gain more insightful information
world_map <- map_data("world")
p <- ggplot() + geom_polygon(data = world_map, aes(x = long, y = lat, group = group),
fill = "grey", alpha = 0.3) + geom_point(data = df_reef, alpha = 0.2, aes(y = Latitude.Degrees,
x = Longitude.Degrees, size = Average_bleaching, color = Temperature_Mean)) + ggtitle("Average Bleaching with Average Temperature") + scale_color_gradient(low = "pink", high = "red") + theme_minimal()
ggplotly(p)
The interactive geo-spatial plot above shows that the color scale indicates the Average Temperature, with pink representing the lowest and red representing the highest levels. The dot size represents the Average Bleaching, with larger dots indicating higher Average Bleaching. From this interactive graph, certain trends can be observed, with locations that have higher average temperature showing higher average bleaching. However, the plot can only present rough trends. If we want to explore the effective impact of temperature on Average Bleaching in more depth, we need further analytical methods, such as linear regression models. Relying solely on visualization is not sufficient.
load("./GSE46474.RData")
gse
head(pData(gse)[,1:5]) # Patient Data (Observations)
head(fData(gse)[,1:5]) # Feature Data (Genes)
head(exprs(gse)[,1:5])
df_blood_wide = exprs(gse)
head(df_blood_wide)
dim(df_blood_wide)
df_blood <- as.data.frame(gse)
# Add the targeted feature "Outcome" to the dataframe
df_blood$Outcome <- ifelse(grepl("AR", gse$title), "Rejection", "Stable")
# Encode the binary Outcome into 1 and 0, with 1 represents Rejection, 0 represents Stable
df_blood$Outcome <- ifelse(df_blood$Outcome == "Rejection", 1, 0)
# 1 and 0 representations should be descrete
df_blood$Outcome <- as.factor(df_blood$Outcome)
gse$Outcome <- ifelse(grepl("AR", gse$title), "Rejection", "Stable")
table(gse$Outcome)
##
## Rejection Stable
## 20 20
design <- model.matrix(~Outcome, data = pData(gse))
fit <- lmFit(exprs(gse), design)
fit <- eBayes(fit)
tT <- topTable(fit, n = Inf)
DT::datatable(round(tT[1:300, ], 2))
featureData <- fData(gse) # GSE contain a featuredata slot that contains information about the genes that were sampled.
colnames(featureData)
## [1] "ID" "GB_ACC"
## [3] "SPOT_ID" "Species Scientific Name"
## [5] "Annotation Date" "Sequence Type"
## [7] "Sequence Source" "Target Description"
## [9] "Representative Public ID" "Gene Title"
## [11] "Gene Symbol" "ENTREZ_GENE_ID"
## [13] "RefSeq Transcript ID" "Gene Ontology Biological Process"
## [15] "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"
blood_300 <- topTable(fit, n = 300, genelist = rownames(gse))
blood_300 <- merge(x=blood_300, y=featureData, by="ID", all.x=TRUE)
#head(blood_300)
load("./GSE138043.RData")
gse
head(pData(gse)[,1:5]) # Patient Data (Observations)
head(fData(gse)[,1:5]) # Feature Data (Genes)
head(exprs(gse)[,1:5])
df_biopsy_wide = exprs(gse)
head(df_biopsy_wide)
dim(df_biopsy_wide)
df_biopsy <- as.data.frame(gse)
df_biopsy$Outcome <- ifelse(grepl("non", gse$characteristics_ch1), "Stable", "Rejection")
df_biopsy$Outcome <- ifelse(df_biopsy$Outcome == "Rejection", 1, 0)
df_biopsy$Outcome <- as.factor(df_biopsy$Outcome)
gse$Outcome <- ifelse(grepl("non-AR", gse$characteristics_ch1), "Stable", "Rejection")
table(gse$Outcome)
##
## Rejection Stable
## 15 37
design <- model.matrix(~Outcome, data = pData(gse))
fit <- lmFit(exprs(gse), design)
fit <- eBayes(fit)
tT <- topTable(fit, n = Inf)
DT::datatable(round(tT[1:300, ], 2))
featureData <- fData(gse)
colnames(featureData)
## [1] "ID" "GB_LIST" "SPOT_ID" "seqname"
## [5] "RANGE_GB" "RANGE_STRAND" "RANGE_START" "RANGE_STOP"
## [9] "total_probes" "gene_assignment" "mrna_assignment" "category"
biopsy_300 <- topTable(fit, n = 300, genelist = rownames(gse))
biopsy_300 <- merge(x=biopsy_300, y=featureData, by="ID", all.x=TRUE)
#head(biopsy_300)
By merging the dataframes and their corresponding feature data using a common variable "ID," we can access the feature symbols. To preprocess the gene symbols' format in the biopsy dataframe, we need to extract the strings between the first "//" and the second "//" and remove the spaces at both ends. Once we correctly reformat the biopsy dataframe, we can identify the overlapping gene symbols in both the blood and biopsy dataframes.
There are 3 common variables in both top 300 differential tables of blood and biopsy database, which are "SLAMF6", "RASA3", "ITGAX".
# Calculate variances for each numerical variable
blood_variances <- apply(df_blood[, -which(names(df_blood) == "Outcome")], 2, var, na.rm = TRUE)
biopsy_variances <- apply(df_biopsy[, -which(names(df_biopsy) == "Outcome")], 2, var, na.rm = TRUE)
# Sort variables by variance in descending order
sorted_blood_variances <- sort(blood_variances, decreasing = TRUE)
sorted_biopsy_variances <- sort(biopsy_variances, decreasing = TRUE)
# Select top 50 variables excluding Outcome which is the dependent variable
blood_selected <- names(sorted_blood_variances)[1:52]
blood_selected <- blood_selected[-c(1,2)]
biopsy_selected <- names(sorted_biopsy_variances)[1:50]
blood_selected <- c(blood_selected, "Outcome")
biopsy_selected <- c(biopsy_selected, "Outcome")
#blood_selected
#biopsy_selected
df_blood_b <- df_blood[blood_selected]
df_biopsy_b <- df_biopsy[biopsy_selected]
dim(df_blood_b)
## [1] 40 51
dim(df_biopsy_b)
## [1] 52 51
As for 2b, my feature selection is to choose the top 50 features with the highest variances. This is because a gene with high variance potentially means that there is a greater chance to differentiate the characteristics between Stable and Rejection after kidney transplant.
Selecting high variances' features is a method of feature selection that supported by following research papers.
- "Feature selection for high-dimensional genomic data" by John D. Storey, Robert Tibshirani, and Joel W. Burdick. This paper discusses the use of variance-based feature selection methods for gene expression data.
- "Feature selection for gene expression data: a comprehensive review" by Jiajia Chen, Jinyan Li, and Jieping Ye. This paper provides a comprehensive review of feature selection methods for gene expression data, including variance-based methods.
- "Selecting informative genes from microarray data using feature selection methods" by Hui Liu, Zhenqiu Liu, and Xiang-Sun Zhang. This paper presents a comparative study of various feature selection methods for gene expression data, including variance-based methods.
- "A comparison of feature selection methods for classification of high-dimensional microarray data" by L. A. Shipp, T. J. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, D. T. Lastavica, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub. This paper compares the performance of various feature selection methods for gene expression data, including variance-based methods, in the context of cancer classification.
The reason I did not use the top 50 genes in toptable as feature selection was due to the context of the question. I understood "differentially expressed" as an adjective rather than a proper noun. Therefore, I read through some research papers referenced above and decided to use the top 50 genes with the highest variances as the most "differentially expressed" genes.
blood_selected
## [1] "NA.7537" "NA.4459" "NA.4461" "X201909_at"
## [5] "X204141_at" "X1562307_at" "X204439_at" "X1565743_at"
## [9] "X204409_s_at" "NA.22486" "X202411_at" "NA.24344"
## [13] "X206632_s_at" "X203153_at" "X1562283_at" "X209728_at"
## [17] "X214131_at" "X213831_at" "X1559500_at" "X1556657_at"
## [21] "X1555745_a_at" "NA.21740" "X1561754_at" "X213797_at"
## [25] "NA.1620" "X205000_at" "NA.11463" "X1565776_at"
## [29] "NA.12727" "NA.12278" "X216813_at" "X1556658_a_at"
## [33] "NA.10586" "X1557512_at" "X209480_at" "X214218_s_at"
## [37] "NA.7818" "NA.6569" "X1561615_s_at" "X1560800_at"
## [41] "X207815_at" "X1554638_at" "X214453_s_at" "X205950_s_at"
## [45] "X204351_at" "NA.12823" "X213446_s_at" "X1554543_at"
## [49] "X206207_at" "NA.14423" "Outcome"
biopsy_selected
## [1] "X3471427" "X2514413" "X2597389" "X3428596" "X3389954" "X3745351"
## [7] "X4028512" "X4031136" "X2513758" "X3656151" "X2459837" "X4048265"
## [13] "X4002394" "X2589291" "X3871095" "X2772566" "X3907456" "X4048241"
## [19] "X3299970" "X3745287" "X3617719" "X2581000" "X2676167" "X2880422"
## [25] "X2773972" "X2620937" "X2830465" "X3865275" "X2362991" "X2563785"
## [31] "X2773947" "X3142381" "X2903219" "X3317117" "X2772805" "X4030162"
## [37] "X3307580" "X3443226" "X3996598" "X3322700" "X2748542" "X3938750"
## [43] "X2967276" "X2790368" "X2773958" "X4030063" "X2692883" "X3907190"
## [49] "X2440413" "X3982612" "Outcome"
Here are 50 selected features from the entire Blood and Biopsy dataset respectively. Selecting features before splitting the data into training and testing sets is a core characteristic of Framework 1.
# Randomly split data into training and testing sets in 80:20 ratio
set.seed(3888)
#set.seed(35)
training_indices <- createDataPartition(df_blood_b$Outcome, p = 0.8, list = FALSE)
training_set <- df_blood_b[training_indices, ]
testing_set <- df_blood_b[-training_indices, ]
# SVM model with Linear Kernel
svm_blood_b <- svm(Outcome ~ ., data = training_set, kernel = "linear")
# The predicted results
print(predicted <- predict(svm_blood_b, newdata = testing_set))
## GSM1130812 GSM1130823 GSM1130824 GSM1130831 GSM1130833 GSM1130836 GSM1130837
## 0 1 0 1 0 1 1
## GSM1130838
## 1
## Levels: 0 1
# Accuracy of the SVM model indicating by the testing set
print(accuracy <- mean(predicted == testing_set$Outcome))
## [1] 0.375
# 50 repeats of 5-folds Cross Validation
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 50)
# Train the SVM model using the repeated cross-validation
svm_linear <- train(Outcome ~ ., data = df_blood_b, method = "svmLinear", trControl = ctrl, tuneLength = 5)
#svm_polynomial <- train(Outcome ~ ., data = df_blood_b, method = "svmPoly", trControl = ctrl, tuneLength = 10)
#svm_radial <- train(Outcome ~ ., data = df_blood_b, method = "svmRadial", trControl = ctrl, tuneLength = 10)
# Model Results
svm_linear
## Support Vector Machines with Linear Kernel
##
## 40 samples
## 50 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ...
## Resampling results:
##
## Accuracy Kappa
## 0.65 0.3
##
## Tuning parameter 'C' was held constant at a value of 1
#svm_polynomial
#svm_radial
# Set up cross-validation parameters
n_repeats <- 50
n_folds <- 5
train_control <- trainControl(method = "repeatedcv", number = n_folds, repeats = n_repeats)
# Create empty vector to store accuracies
accuracies <- rep(0, n_repeats)
# Perform cross-validation and store accuracies
for (i in 1:n_repeats) {
set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = df_blood_b, method = "svmLinear", trControl = train_control)
accuracies[i] <- svm_model$results$Accuracy[1]
}
# Plot results as boxplots and histograms
par(mfrow=c(1,2))
boxplot(accuracies, main="Accuracy", ylab="Value")
hist(accuracies, main="Accuracy")
In the 50 iterations of 5-fold cross-validation, the SVM model with a linear kernel resulted in an average accuracy of 66%, as shown by the box plot and histogram. This level of accuracy is not considered good performance. Using SVM models with polynomial or radial kernels might lead to better performance since they offer more complexity, but they require more computational power to train and test.
set.seed(3888)
training_indices <- createDataPartition(df_biopsy_b$Outcome, p = 0.8, list = FALSE)
training_set <- df_biopsy_b[training_indices, ]
testing_set <- df_biopsy_b[-training_indices, ]
# SVM model with Linear Kernel
svm_biopsy_b <- svm(Outcome ~ ., data = training_set, kernel = "linear")
# The predicted results
print(predicted <- predict(svm_biopsy_b, newdata = testing_set))
## GSM4097342 GSM4097345 GSM4097348 GSM4097351 GSM4097352 GSM4097361 GSM4097362
## 0 0 0 1 1 0 1
## GSM4097369 GSM4097376 GSM4097378
## 0 0 0
## Levels: 0 1
# Accuracy of the SVM model indicating by the testing set
print(accuracy <- mean(predicted == testing_set$Outcome))
## [1] 0.8
# 50 repeats of 5-folds Cross Validation
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 50)
# Train the SVM model using the repeated cross-validation
svm_linear <- train(Outcome ~ ., data = df_biopsy_b, method = "svmLinear", trControl = ctrl, tuneLength = 10)
#svm_polynomial <- train(Outcome ~ ., data = df_blood_b, method = "svmPoly", trControl = ctrl, tuneLength = 10)
#svm_radial <- train(Outcome ~ ., data = df_blood_b, method = "svmRadial", trControl = ctrl, tuneLength = 10)
# Model Results
svm_linear
## Support Vector Machines with Linear Kernel
##
## 52 samples
## 50 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 41, 42, 42, 42, 41, 41, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6809091 0.2667636
##
## Tuning parameter 'C' was held constant at a value of 1
#svm_polynomial
#svm_radial
# Set up cross-validation parameters
n_repeats <- 50
n_folds <- 5
train_control <- trainControl(method = "repeatedcv", number = n_folds, repeats = n_repeats)
# Create empty vector to store accuracies
accuracies <- rep(0, n_repeats)
# Perform cross-validation and store accuracies
for (i in 1:n_repeats) {
set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = df_biopsy_b, method = "svmLinear", trControl = train_control)
accuracies[i] <- svm_model$results$Accuracy[1]
}
# Plot results as boxplots and histograms
par(mfrow=c(1,2))
boxplot(accuracies, main="Accuracy", ylab="Value")
hist(accuracies, main="Accuracy")
In the 50 iterations of 5-fold cross-validation, the SVM model with a linear kernel resulted in an average accuracy of 69%, as shown by the box plot and histogram. This level of accuracy is also not considered good performance, but it is slightly better than the performance when predicting with the blood dataset.
Regarding Framework 1, the conclusion is that the biopsy dataset is better suited to predict the occurrence of rejection after a kidney transplant for a particular patient than the blood dataset.
set.seed(3888)
#set.seed(9527)
# Split the blood data-set into 80:20 ratio
trainIndex <- createDataPartition(df_blood$Outcome, p = 0.8, list = FALSE)
trainData <- df_blood[trainIndex, ]
testData <- df_blood[-trainIndex, ]
dim(trainData)
## [1] 32 54660
dim(testData)
## [1] 8 54660
# Feature Selection on the spited training set
# Calculate variances for each numerical variable
blood_variances <- apply(trainData[, -which(names(trainData) == "Outcome")], 2, var, na.rm = TRUE)
# Sort variables by variance in descending order
sorted_blood_variances <- sort(blood_variances, decreasing = TRUE)
# Select top 50 variables excluding Outcome which is the dependent variable
blood_selected <- names(sorted_blood_variances)[1:52]
blood_selected <- blood_selected[-c(1,2)]
blood_selected <- c(blood_selected, "Outcome")
trainData <- trainData[blood_selected]
testData <- testData[blood_selected]
dim(trainData)
## [1] 32 51
dim(testData)
## [1] 8 51
blood_selected
## [1] "NA.7537" "NA.4459" "NA.4461" "X201909_at"
## [5] "X202411_at" "X204439_at" "X204141_at" "X204409_s_at"
## [9] "NA.22486" "NA.24344" "X1565743_at" "X214131_at"
## [13] "X203153_at" "X1562307_at" "NA.1620" "X213831_at"
## [17] "X209728_at" "X1561754_at" "X213797_at" "X1556657_at"
## [21] "NA.11463" "X1555745_a_at" "X206632_s_at" "X205000_at"
## [25] "X214218_s_at" "NA.7818" "NA.10586" "NA.6569"
## [29] "X1556658_a_at" "X1562283_at" "X209480_at" "X204351_at"
## [33] "X214453_s_at" "X213446_s_at" "X1561615_s_at" "X216813_at"
## [37] "X1559500_at" "NA.12278" "NA.16356" "X205950_s_at"
## [41] "X207815_at" "NA.21740" "X1560800_at" "X209369_at"
## [45] "NA.12727" "X203911_at" "X1557512_at" "NA.14584"
## [49] "NA.12823" "X215666_at" "Outcome"
Here are 50 selected features from the training set of data-set (80% of the entire data-set). Selecting features after splitting the data into training and testing sets is a core characteristic of Framework 2.
# SVM model with Linear Kernel
svm_blood_c <- svm(Outcome ~ ., data = trainData, kernel = "linear")
# The predicted results
print(predicted <- predict(svm_blood_c, newdata = testData))
## GSM1130812 GSM1130823 GSM1130824 GSM1130831 GSM1130833 GSM1130836 GSM1130837
## 0 1 0 1 0 1 1
## GSM1130838
## 1
## Levels: 0 1
# Accuracy of the SVM model indicating by the testing set
print(accuracy <- mean(predicted == testData$Outcome))
## [1] 0.375
# 50 repeats of 5-folds Cross Validation
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 50)
# Aggreate the pre-splited training and testing sets for the following cross validation
cv_blood <- rbind(trainData, testData)
# Train the SVM model using the repeated cross-validation
svm_linear <- train(Outcome ~ ., data = cv_blood, method = "svmLinear", trControl = ctrl, tuneLength = 5)
# Model Results
svm_linear
## Support Vector Machines with Linear Kernel
##
## 40 samples
## 50 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7095 0.419
##
## Tuning parameter 'C' was held constant at a value of 1
# Set up cross-validation parameters
n_repeats <- 50
n_folds <- 5
train_control <- trainControl(method = "repeatedcv", number = n_folds, repeats = n_repeats)
# Create empty vector to store accuracies
accuracies <- rep(0, n_repeats)
# Perform cross-validation and store accuracies
for (i in 1:n_repeats) {
set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = cv_blood, method = "svmLinear", trControl = train_control)
accuracies[i] <- svm_model$results$Accuracy[1]
}
# Plot results as boxplots and histograms
par(mfrow=c(1,2))
boxplot(accuracies, main="Accuracy", ylab="Value")
hist(accuracies, main="Accuracy")
In the 50 iterations of 5-fold cross-validation, the SVM model with a linear kernel resulted in an average accuracy of 70%, as shown by the box plot and histogram. This level of accuracy is not considered good performance.
set.seed(3888)
# Split the blood data-set into 80:20 ratio
trainIndex <- createDataPartition(df_biopsy$Outcome, p = 0.8, list = FALSE)
trainData <- df_biopsy[trainIndex, ]
testData <- df_biopsy[-trainIndex, ]
dim(trainData)
## [1] 42 16920
dim(testData)
## [1] 10 16920
# Feature Selection on the spited training set
# Calculate variances for each numerical variable
biopsy_variances <- apply(trainData[, -which(names(trainData) == "Outcome")], 2, var, na.rm = TRUE)
# Sort variables by variance in descending order
sorted_biopsy_variances <- sort(biopsy_variances, decreasing = TRUE)
# Select top 50 variables excluding Outcome which is the dependent variable
biopsy_selected <- names(sorted_biopsy_variances)[1:50]
biopsy_selected <- c(biopsy_selected, "Outcome")
trainData <- trainData[biopsy_selected]
testData <- testData[biopsy_selected]
dim(trainData)
## [1] 42 51
dim(testData)
## [1] 10 51
biopsy_selected
## [1] "X3471427" "X2514413" "X3428596" "X2597389" "X3389954" "X3745351"
## [7] "X4028512" "X4031136" "X3656151" "X2513758" "X4002394" "X2459837"
## [13] "X3907456" "X3871095" "X2589291" "X3299970" "X4048265" "X3745287"
## [19] "X4048241" "X2772566" "X3617719" "X2676167" "X2581000" "X2620937"
## [25] "X2830465" "X2880422" "X3865275" "X2362991" "X2773972" "X2563785"
## [31] "X2773947" "X3317117" "X2772805" "X2967276" "X3142381" "X3443226"
## [37] "X2903219" "X3307580" "X3907190" "X4030162" "X2692883" "X2741206"
## [43] "X2773958" "X3996598" "X3322700" "X3317056" "X2748542" "X4030063"
## [49] "X2731192" "X2400027" "Outcome"
Here are 50 selected features from the training set of data-set (80% of the entire data-set). Selecting features after splitting the data into training and testing sets is a core characteristic of Framework 2.
# SVM model with Linear Kernel
svm_biopsy_c <- svm(Outcome ~ ., data = trainData, kernel = "linear")
# The predicted results
print(predicted <- predict(svm_biopsy_c, newdata = testData))
## GSM4097342 GSM4097345 GSM4097348 GSM4097351 GSM4097352 GSM4097361 GSM4097362
## 0 0 0 1 1 0 0
## GSM4097369 GSM4097376 GSM4097378
## 0 0 0
## Levels: 0 1
# Accuracy of the SVM model indicating by the testing set
print(accuracy <- mean(predicted == testData$Outcome))
## [1] 0.9
# 50 repeats of 5-folds Cross Validation
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 50)
# Aggreate the pre-splited training and testing sets for the following cross validation
cv_biopsy <- rbind(trainData, testData)
# Train the SVM model using the repeated cross-validation
svm_linear <- train(Outcome ~ ., data = cv_blood, method = "svmLinear", trControl = ctrl, tuneLength = 5)
# Model Results
svm_linear
## Support Vector Machines with Linear Kernel
##
## 40 samples
## 50 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 50 times)
## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ...
## Resampling results:
##
## Accuracy Kappa
## 0.701 0.402
##
## Tuning parameter 'C' was held constant at a value of 1
# Set up cross-validation parameters
n_repeats <- 50
n_folds <- 5
train_control <- trainControl(method = "repeatedcv", number = n_folds, repeats = n_repeats)
# Create empty vector to store accuracies
accuracies <- rep(0, n_repeats)
# Perform cross-validation and store accuracies
for (i in 1:n_repeats) {
set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = cv_biopsy, method = "svmLinear", trControl = train_control)
accuracies[i] <- svm_model$results$Accuracy[1]
}
# Plot results as boxplots and histograms
par(mfrow=c(1,2))
boxplot(accuracies, main="Accuracy", ylab="Value")
hist(accuracies, main="Accuracy")
In 50 iterations of 5-fold cross-validation, the SVM model with a linear kernel resulted in an average accuracy of 68.5%, as shown by the box plot and histogram. This level of accuracy is not considered good performance.
# Create empty vector to store accuracies
accuracies_fw1_blood <- rep(0, n_repeats)
accuracies_fw1_biopsy <- rep(0, n_repeats)
accuracies_fw2_blood <- rep(0, n_repeats)
accuracies_fw2_biopsy <- rep(0, n_repeats)
# Perform cross-validation and store accuracies
for (i in 1:n_repeats) {
#set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = df_blood_b, method = "svmLinear", trControl = train_control)
accuracies_fw1_blood[i] <- svm_model$results$Accuracy[1]
}
for (i in 1:n_repeats) {
#set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = df_biopsy_b, method = "svmLinear", trControl = train_control)
accuracies_fw1_biopsy[i] <- svm_model$results$Accuracy[1]
}
for (i in 1:n_repeats) {
#set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = cv_blood, method = "svmLinear", trControl = train_control)
accuracies_fw2_blood[i] <- svm_model$results$Accuracy[1]
}
for (i in 1:n_repeats) {
#set.seed(i) # set seed for reproducibility
svm_model <- train(Outcome ~ ., data = cv_biopsy, method = "svmLinear", trControl = train_control)
accuracies_fw2_biopsy[i] <- svm_model$results$Accuracy[1]
}
compare_plots <- list(accuracies_fw1_blood, accuracies_fw1_biopsy, accuracies_fw2_blood, accuracies_fw2_biopsy)
boxplot(compare_plots, names = c("FW1 Blood", "FW1 Biopsy", "FW2 Blood", "FW2 Biopsy"), xlab = "Different data-set in different Frameworks", ylab = "Accuracy", main = "Compare different Accuracies")
This boxplot shows the aggregated performance results of 50 repeats of 5-fold cross-validation for four different cases: Framework 1 with Blood or Biopsy data and Framework 2 with Blood or Biopsy data.
According to the plot, Framework 2 is more valid than Framework 1 since the medians of predicted accuracies from Framework 2 are higher than those from Framework 1. In Framework 1, the Biopsy data resulted in a higher median of predicted accuracy, while the Blood data resulted in the lowest median predicted accuracy of all cases. In contrast, in Framework 2, the Blood data resulted in the highest median predicted accuracy of all cases, while the Biopsy data resulted in an acceptable accuracy but was slightly worse than in Framework 1. Generally, blood data is better than biopsy in prediction.
In general, we can conclude that among all four cases, training an SVM model with blood data-set based on Framework 2 and evaluated by 50 repeats of 5-fold cross-validation would perform best with the highest median and maximum accuracy. However, considering the actual situation of kidney transplant, which is an extremely serious and infallible action, a 70% accuracy is far from enough to be applied in medical suggestion or determination. Moreover, simply looking at the predicted accuracy is not enough. In this domain, we must strongly avoid false negatives that predict stability while actually rejecting. We should consciously increase the recall by reasonably sacrificing precision. Tuning the hyperparameters could be helpful in achieving higher recall and accuracy.
Using different feature selection methods could potentially improve the model. In both Framework 1 and 2, I selected 50 features with the highest variances as predictors, which was a supportive feature selection method, particularly in the field of gene analysis. Another available feature selection method could be selecting the top 50 features in the "toptable" produced in 2a, which are the 50 features with the lowest p-values in multiple t-tests. The reason I did not use such a method of feature selection was due to the context of the question. I understood "differentially expressed" as an adjective rather than a proper noun. Therefore, I read through some research papers referenced above and decided to use the top 50 genes with the highest variances as the most "differentially expressed" genes.
set.seed(3888)
# Read in data
dir_short <- "zoe_spiker/Length3"
all_files_short <- list.files(dir_short)
wave_file_short <- list()
for (i in all_files_short) {
## cat(paste0(i, "...", "\n"))
wave_file_short[[i]] <- tuneR::readWave(file.path(dir_short, i))
}
dir_length8 <- "zoe_spiker/Length8"
all_files_length8 <- list.files(dir_length8)
wave_file_length8 <- list()
for (i in all_files_length8) {
## cat(paste0(i, "...", "\n"))
wave_file_length8[[i]] <- tuneR::readWave(file.path(dir_length8, i))
}
dir_long <- "zoe_spiker/Long"
all_files_long <- list.files(dir_long)
wave_file_long <- list()
for (i in all_files_long) {
## cat(paste0(i, "...", "\n"))
wave_file_long[[i]] <- tuneR::readWave(file.path(dir_long, i))
}
wave_file_agg <- append(append(wave_file_short, wave_file_length8), wave_file_long)
LR_detection = function(seq) {
maxval = which.max(seq)
minval = which.min(seq)
movement = ifelse(maxval < minval, "L", "R")
return(movement)
}
streaming_classifier <- function(wave_file, window_size = wave_file@samp.rate, increment = window_size/3, threshold_events = 30) {
# Extract left channel and time values
Y <- wave_file@left
xtime <- seq_len(length(Y)) / wave_file@samp.rate
# Initialize variables
predicted_labels <- numeric()
predicted_times <- numeric()
lower_interval <- 1
max_time <- max(xtime) * window_size
# Loop through intervals
while (max_time > lower_interval + window_size) {
upper_interval <- lower_interval + window_size
interval <- Y[lower_interval:upper_interval]
# Compute test statistic
test_stat <- sum(interval[-length(interval)] * interval[-1] <= 0) # Zero-crossing
# Make prediction
if (test_stat < threshold_events) {
predicted <- LR_detection(interval)
predicted_labels <- c(predicted_labels, predicted)
predicted_times <- c(predicted_times, lower_interval + window_size / 2)
lower_interval <- lower_interval + window_size
} else {
lower_interval <- lower_interval + increment
}
}
# Return predicted labels
return(predicted_labels)
}
# Test the Streaming Classifier
streaming_classifier(wave_file_agg$LLL_z3.wav)
## [1] "L" "L" "R"
streaming_classifier(wave_file_agg$LLRLRLRL_z.wav)
## [1] "L" "R" "R" "R" "L" "R" "R" "R"
streaming_classifier(wave_file_agg$LLLRLLLRLRRLRRRLRLLL_Z.wav)
## [1] "R" "L" "L" "L" "L" "R" "L" "R" "R" "R" "R" "L" "L" "L" "R" "R" "R" "L" "R"
## [20] "R"
The function streaming classifier is a classifier designed to classify eye movements in a streaming fashion. The function takes as input a wave file containing audio data, and parameters such as the window size, increment, and threshold for event detection.
Streaming classifier works by first extracting the left channel and time values from the input wave file. It then initializes variables for storing predicted labels and times, as well as for defining the lower and upper intervals of the streaming data.
The function then loops through each interval, with the size of the interval defined by the window size parameter. Within each interval, the function computes a test statistic based on the number of zero crossings in the interval. If the number of zero crossings is below the threshold for event detection, the function makes a prediction using the LR detection function, and stores the predicted label and time. If the number of zero crossings is above the threshold, the function skips ahead to the next interval by increasing the lower interval by the increment parameter. The LR detection is based on a very simple rule of classification. It extracts the maximum and minimum values in a sequence over a period and compares their absolute values. If the maximum absolute value of the signal over a period is less than the minimum absolute value of the signal, it indicates a left movement. Otherwise, a right movement is indicated.
Finally, the function returns the predicted labels for each interval.
In summary, the streaming classifier function is a streaming classifier that takes in audio data and makes predictions for eye movements by computing zero crossings and using a threshold to determine when to make predictions. The function is designed to work in a streaming fashion, where data is processed in real-time as it becomes available.
However, this streaming classifier is quite simplistic for such a complex project. As demonstrated by the three predictive samples above, the classifier has low accuracies.
True = 0
False = 0
y_labels <- y_labels[1:24]
results_list <- results_list[1:24]
for (i in 1:length(y_labels)){
if (results_list[[i]] == y_labels[[i]]){
True = True + 1
}else{
False = False + 1
}
print(paste("Compare the predictied results: ", results_list[[i]]))
print(paste("with the actual results: ", y_labels[[i]]))
}
## [1] "Compare the predictied results: LL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: R"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LRR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: RLLR"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRRz"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RRL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RR"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLRL"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLR"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRR"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: RLRR"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: RRLR"
## [1] "with the actual results: RRR"
print(paste("Total amount of correct prediction on length 3 signals by the streaming classifier: ", True))
## [1] "Total amount of correct prediction on length 3 signals by the streaming classifier: 6"
print(paste("Total amount of wrong prediction on length 3 singals by the streaming classifier: ", False))
## [1] "Total amount of wrong prediction on length 3 singals by the streaming classifier: 18"
print(paste("The accuracy of the streaming classifer by preidicting length 3 singals is: ", True/length(y_labels)))
## [1] "The accuracy of the streaming classifer by preidicting length 3 singals is: 0.25"
As shown in the metric above, the streaming classifier correctly predicted six files with three signals, but wrongly predicted 18 files with three signals. Therefore, the accuracy of the streaming classifier on predicting three signals is only 25%, which is considered extremely poor. As previously mentioned, the complexity of the streaming classifier is not sufficient to handle such a signal classification task. Additionally, the "hyperparameters" of the streaming classifier could be better tuned, such as the window size, threshold, and increment value.
# create a list to store the results
results_list <- list()
# iterate through each variable in wave_file_agg
for (filename in names(wave_file_agg)) {
# apply the streaming_classifier() function to the current file
result <- streaming_classifier_2(wave_file_agg[[filename]])
# store the result in the results_list
results_list[[filename]] <- result
}
results_list <- unname(results_list)
for(i in 1:length(results_list)){
results_list[[i]] <- paste(results_list[[i]], collapse = "")
}
True = 0
False = 0
y_labels <- y_labels[1:24]
results_list <- results_list[1:24]
for (i in 1:length(y_labels)){
if (results_list[[i]] == y_labels[[i]]){
True = True + 1
}else{
False = False + 1
}
print(paste("Compare the predictied results: ", results_list[[i]]))
print(paste("with the actual results: ", y_labels[[i]]))
}
## [1] "Compare the predictied results: L"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LR"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LRRL"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLL"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: RL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: RR"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: RRLRLR"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRLL"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LLRL"
## [1] "with the actual results: LRRz"
## [1] "Compare the predictied results: LRRL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RRRLL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: LLRL"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RRRL"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLLRLL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RLRRR"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RLLL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RLLL"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: LLRL"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: LLLR"
## [1] "with the actual results: RRR"
print(paste("Total amount of correct prediction on length 3 signals by the streaming classifier 2: ", True))
## [1] "Total amount of correct prediction on length 3 signals by the streaming classifier 2: 2"
print(paste("Total amount of wrong prediction on length 3 singals by the streaming classifier 2: ", False))
## [1] "Total amount of wrong prediction on length 3 singals by the streaming classifier 2: 22"
print(paste("The accuracy of the streaming classifer 2 by preidicting length 3 singals is: ", True/length(y_labels)))
## [1] "The accuracy of the streaming classifer 2 by preidicting length 3 singals is: 0.0833333333333333"
# create a list to store the results
results_list <- list()
# iterate through each variable in wave_file_agg
for (filename in names(wave_file_agg)) {
# apply the streaming_classifier() function to the current file
result <- streaming_classifier_3(wave_file_agg[[filename]])
# store the result in the results_list
results_list[[filename]] <- result
}
results_list <- unname(results_list)
for(i in 1:length(results_list)){
results_list[[i]] <- paste(results_list[[i]], collapse = "")
}
True = 0
False = 0
y_labels <- y_labels[1:24]
results_list <- results_list[1:24]
for (i in 1:length(y_labels)){
if (results_list[[i]] == y_labels[[i]]){
True = True + 1
}else{
False = False + 1
}
print(paste("Compare the predictied results: ", results_list[[i]]))
print(paste("with the actual results: ", y_labels[[i]]))
}
## [1] "Compare the predictied results: LL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: RR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLRR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLLL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: RLRL"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRR"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRR"
## [1] "with the actual results: LRRz"
## [1] "Compare the predictied results: RRLLR"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RLRR"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLR"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLRL"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRR"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRRL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRRR"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: RRRRR"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: RRRLR"
## [1] "with the actual results: RRR"
print(paste("Total amount of correct prediction on length 3 signals by the streaming classifier 3: ", True))
## [1] "Total amount of correct prediction on length 3 signals by the streaming classifier 3: 8"
print(paste("Total amount of wrong prediction on length 3 singals by the streaming classifier 3: ", False))
## [1] "Total amount of wrong prediction on length 3 singals by the streaming classifier 3: 16"
print(paste("The accuracy of the streaming classifer 3 by preidicting length 3 singals is: ", True/length(y_labels)))
## [1] "The accuracy of the streaming classifer 3 by preidicting length 3 singals is: 0.333333333333333"
results_list <- unname(results_list)
for(i in 1:length(results_list)){
results_list[[i]] <- paste(results_list[[i]], collapse = "")
}
True = 0
False = 0
y_labels <- y_labels[1:24]
results_list <- results_list[1:24]
for (i in 1:length(y_labels)){
if (results_list[[i]] == y_labels[[i]]){
True = True + 1
}else{
False = False + 1
}
print(paste("Compare the predictied results: ", results_list[[i]]))
print(paste("with the actual results: ", y_labels[[i]]))
}
## [1] "Compare the predictied results: LL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: LLL"
## [1] "with the actual results: LLL"
## [1] "Compare the predictied results: R"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLR"
## [1] "with the actual results: LLR"
## [1] "Compare the predictied results: LLL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: LRL"
## [1] "with the actual results: LRL"
## [1] "Compare the predictied results: RLRR"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRR"
## [1] "with the actual results: LRR"
## [1] "Compare the predictied results: LRR"
## [1] "with the actual results: LRRz"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RRL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RLL"
## [1] "Compare the predictied results: RLR"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLR"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLRL"
## [1] "with the actual results: RLR"
## [1] "Compare the predictied results: RLL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRR"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRRL"
## [1] "with the actual results: RRL"
## [1] "Compare the predictied results: RRR"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: RRRR"
## [1] "with the actual results: RRR"
## [1] "Compare the predictied results: RRR"
## [1] "with the actual results: RRR"
print(paste("Total amount of correct prediction on length 3 signals by the streaming classifier 4: ", True))
## [1] "Total amount of correct prediction on length 3 signals by the streaming classifier 4: 12"
print(paste("Total amount of wrong prediction on length 3 singals by the streaming classifier 4: ", False))
## [1] "Total amount of wrong prediction on length 3 singals by the streaming classifier 4: 12"
print(paste("The accuracy of the streaming classifer 4 by preidicting length 3 singals is: ", True/length(y_labels)))
## [1] "The accuracy of the streaming classifer 4 by preidicting length 3 singals is: 0.5"
# Define custom values and labels
my_values <- c(0.25, 0.08, 0.33, 0.5)
my_labels <- c("Classifier 1", "Classifier 2", "Classifier 3", "Classifier 4")
# Create bar plot with custom labels
barplot(my_values, names.arg = my_labels, main = "Compare 4 Streaming Classifiers with different hypermeters")
Four classifiers have been created and tested, all of them are streaming classifiers which follow the same classification rules on left or right movements as event or non-event. The tested accuracies are shown in the bar plot above which indicates that Classifier 4 with an increment value equal to wave sample rate/50, and a threshold of events equal to 25 resulted in the highest accuracy among all four classifiers, with a 50% accuracy on signal prediction with length 3.
While 50% accuracy does not indicate good performance, it was not a binary classification task, so 50% does not mean the classifier randomly guessed the outcome. Our classifiers performed poorly, but they have a correct and logical mechanism.
y_labels <- names(wave_file_agg)
remove_after_underscore <- function(x) {
sub("_.*", "", x)
}
y_labels <- lapply(y_labels, remove_after_underscore)
remove_after_underscore <- function(x) {
sub("_.*", "", x)
}
y_labels <- list()
for (col in names(wave_file_agg)) {
col <- lapply(col, remove_after_underscore)
y_labels <- c(y_labels, col)
}
y_labels <- as.list(y_labels)
# create a list to store the results
results_list <- list()
# iterate through each variable in wave_file_agg
for (filename in names(wave_file_agg)) {
# apply the streaming_classifier() function to the current file
result <- streaming_classifier(wave_file_agg[[filename]])
# store the result in the results_list
results_list[[filename]] <- result
}
results_list <- unname(results_list)
for(i in 1:length(results_list)){
results_list[[i]] <- paste(results_list[[i]], collapse = "")
}
# Use the classifier with the best performance to classify medium and long signals.
# create a list to store the results
results_list <- list()
# iterate through each variable in wave_file_agg
for (filename in names(wave_file_agg)) {
# apply the streaming_classifier() function to the current file
result <- streaming_classifier_4(wave_file_agg[[filename]])
# store the result in the results_list
results_list[[filename]] <- result
}
results_list <- unname(results_list)
for(i in 1:length(results_list)){
results_list[[i]] <- paste(results_list[[i]], collapse = "")
}
True = 0
False = 0
y_labels <- y_labels[25:31]
results_list <- results_list[25:31]
for (i in 1:length(y_labels)){
if (results_list[[i]] == y_labels[[i]]){
True = True + 1
}else{
False = False + 1
}
print(paste("Compare the predictied results: ", results_list[[i]]))
print(paste("with the actual results: ", y_labels[[i]]))
}
## [1] "Compare the predictied results: LRLRLRRLL"
## [1] "with the actual results: LLRLRLRL"
## [1] "Compare the predictied results: LLRRLLRR"
## [1] "with the actual results: LLRRLLLR"
## [1] "Compare the predictied results: LRRRLLR"
## [1] "with the actual results: LLRRRLLL"
## [1] "Compare the predictied results: RRRRLLRL"
## [1] "with the actual results: LRRRLLRL"
## [1] "Compare the predictied results: RRRLLRL"
## [1] "with the actual results: RRRLRLLR"
## [1] "Compare the predictied results: RLLRRLLLRLRRLRRRLRLRL"
## [1] "with the actual results: LLLRLLLRLRRLRRRLRLLL"
## [1] "Compare the predictied results: RRLRRLRLRLLRRRRRL"
## [1] "with the actual results: RRLRRLRLRLLLLLLRRLRL"
print(paste("Total amount of correct prediction on medium and long signals by the streaming classifier 4: ", True))
## [1] "Total amount of correct prediction on medium and long signals by the streaming classifier 4: 0"
print(paste("Total amount of wrong prediction on medium and long singals by the streaming classifier 4: ", False))
## [1] "Total amount of wrong prediction on medium and long singals by the streaming classifier 4: 7"
print(paste("The accuracy of the streaming classifer 4 by preidicting medium and long singals is: ", True/length(y_labels)))
## [1] "The accuracy of the streaming classifer 4 by preidicting medium and long singals is: 0"
As shown in the results above, the best classifier among the four streaming classifiers for predicting signals with a length of 3 failed to predict the direction of movements with a length of 8 and longer signals, resulting in zero predicted accuracy. However, the classifier correctly predicted most of the amounts of movements, and the lengths of the predicted results were equal to the lengths of the actual movements, which showed a good aspect of our classifier.
Plotly R Open Source Graphing Library: Instructions for creating an interactive geo-spatial plot with Plotly, which were eventually plotted using "ggplotly" instead. Available at https://plotly.com/r/.
DATA3888_2023 Lab: Week 1: Partial sample codes for using 50 repeats 5-folds CV in Question 2b and 2c. Available at https://canvas.sydney.edu.au/files/29637976?fd_cookie_set=1.
DATA3888_2023 Lab: Week 2 - Biomedical Data Analysis: Partial sample codes for R.Data import, toptable formation, feature extraction, and SVM model in Question 2a, 2b, and 2c. Available at https://canvas.sydney.edu.au/courses/47683/pages/week-2-omics?module_item_id=1870825.
DATA3888_2023 Lab: Week 3 - Brain Box Data Analytics: Partial sample codes for wave data import, generating training data, movement extraction function, and the skeleton codes of streaming classifier while loops in Question 3a. Available at https://canvas.sydney.edu.au/courses/47683/pages/week-3-brainbox?module_item_id=1870824.
Liu, H., Liu, Z., & Zhang, X. S. (2009). Selecting informative genes from microarray data using feature selection methods. BMC bioinformatics, 10(Suppl 1), S6. The paper discusses the method of feature selection to select the top 50 most differentially expressed genes. Available at https://link.springer.com/article/10.1007/s10015-008-0534-4.
Datacamp: Support Vector Machine in R. Provides an understanding of the mechanism of SVM with linear and non-linear kernels. Available at https://www.datacamp.com/tutorial/support-vector-machines-r.
Techvidvan: SVM in R for Data Classification using e1071 Package. Provides skeleton codes for SVM model. Available at https://techvidvan.com/tutorials/svm-in-r/.
Stefanowski, J., & Brzezinski, D. (2008). Streaming Classification. The paper discusses the mechanism of a basic streaming classifier. Available at http://www.cs.put.poznan.pl/dbrzezinski/publications/Stream_Classification.pdf.
StackOverflow: Found solutions for several bugs in SVM model training and cross-validation.
GitHub: Found solutions for several bugs in streaming function and the streaming classifier evaluations, particularly solving the logical errors in the for loops and while loops.